Welcome to my analysis of NYC Taxi data! I obtained the data from Chris Whong who was able to access it through a FOIL (Freedom of Information Law) request. The data set contains information on every trip taken in a NYC yellow cab in January 2013 (about 14 million observations).
My analysis is composed of three files:
* eda.Rmd (exploratory data analysis)
* individual_taxis.Rmd (analysis of individual cab drivers, or “cabbies”)
* profit_strategies.Rmd (analysis of cabbie revenue and profit strategies)
I learned all of these techniques during my summer at Cricket’s Circle- special thanks to Bryce and all who helped me along the way.
This file contains the exploratory data analysis of the data set. Essentially, this is a record of my initial exploration of this data. We’ll need some libraries to facilitate this exploration.
source('db_connection.r')
library(ggplot2)
library(tidyr)
library(dplyr)
library(lubridate)
library(rpart)
library(rpart.plot)
library(corrgram)
library(prob)
First, I’ll pull the data set from my database. My laptop can only handle a million observations at a time, so I’ll limit my data set to 1,000,000 taxi trip records.
taxi_raw <- query_results("
SELECT *
FROM trips
INNER JOIN fares ON trips.id = fares.trip_id
-- ORDER BY trips.hack_license, trips.medallion, trips.pickup_datetime
LIMIT 1000000;
")
taxi_raw <- taxi_raw[,unique(colnames(taxi_raw))]
In order to better analyze the data, let’s divide the longitude/latitude pairings into boroughs (this is just an estimation– the boundaries are far from perfect):
taxi_raw <- taxi_raw %>%
mutate(pickup_borough =
ifelse(pickup_longitude> -74.01 & pickup_longitude< -73.927 &
pickup_latitude > 40.701 & pickup_latitude < 40.875, "manhattan",
ifelse(pickup_latitude<40.7 & pickup_longitude > -73.85 |
pickup_longitude > -73.96 & pickup_longitude < -73.92 &
pickup_latitude < 40.704 & pickup_latitude >40.739, "brooklyn",
ifelse(pickup_longitude> - 73.927 | pickup_latitude > 40.875, "bronx",
ifelse(pickup_longitude < -74.04, "staten_island", "queens")))))
taxi_raw <- taxi_raw %>%
mutate(dropoff_borough =
ifelse(dropoff_longitude> -74.01 & dropoff_longitude< -73.927 &
dropoff_latitude > 40.701 & dropoff_latitude < 40.875, "manhattan",
ifelse(dropoff_latitude<40.7 & dropoff_longitude > -73.85 |
dropoff_longitude > -73.96 & dropoff_longitude < -73.92 &
dropoff_latitude < 40.704 & dropoff_latitude >40.739, "brooklyn",
ifelse(dropoff_longitude> - 73.927 | dropoff_latitude > 40.875, "bronx",
ifelse(dropoff_longitude < -74.04, "staten_island", "queens")))))
# type-casting my variables
taxi_raw <- taxi_raw %>%
mutate(
medallion = as.factor(medallion),
hack_license = as.factor(hack_license),
vendor_id = as.factor(vendor_id),
rate_code = as.factor(rate_code),
pickup_datetime = ymd_hms(pickup_datetime),
dropoff_datetime = ymd_hms(dropoff_datetime), # UTC incorrect
payment_type = as.factor(payment_type),
fare_amount = as.double(fare_amount),
surcharge = as.double(surcharge),
mta_tax = as.double(mta_tax),
tip_amount = as.double(tip_amount),
tolls_amount = as.double(tolls_amount),
total_amount = as.double(total_amount),
pickup_borough = as.factor(pickup_borough),
dropoff_borough = as.factor(dropoff_borough),
trip_time_in_secs = as.integer(trip_time_in_secs),
trip_distance = as.double(trip_distance),
passenger_count = as.integer(passenger_count)
)
Hmm… how many pickups per borough?
borough_total_pickups <- taxi_raw %>%
dplyr::filter(!is.na(pickup_borough) & !is.na(dropoff_borough)) %>%
group_by(pickup_borough) %>%
summarize(borough_pickups = n())
borough_total_pickups
## Source: local data frame [5 x 2]
##
## pickup_borough borough_pickups
## 1 bronx 27868
## 2 brooklyn 34241
## 3 manhattan 904232
## 4 queens 33168
## 5 staten_island 484
Let’s perform some EDA. How many trips per day of week?
days_of_week <- read.csv("days_of_week - Sheet1 (1).csv")
taxi <- taxi_raw %>%
dplyr::mutate(tip_percentage = tip_amount/fare_amount*100) %>%
left_join(borough_total_pickups, by = "pickup_borough") %>%
tidyr::separate(pickup_datetime, c("pickup_date", "pickup_time"), " ") %>%
tidyr::separate(pickup_time, c("pickup_hour", "pickup_min", "pickup_sec"), ":") %>%
left_join(days_of_week, by = "pickup_date") %>%
mutate(pickup_date = ymd(pickup_date)) %>%
mutate(pickup_hour = as.integer(pickup_hour)) %>%
mutate(pickup_min = as.integer(pickup_min)) %>%
mutate(pickup_sec = as.integer(pickup_sec))
taxi %>%
group_by(weekday) %>%
summarize(total_rides = n())
## Source: local data frame [7 x 2]
##
## weekday total_rides
## 1 Friday 141082
## 2 Monday 114743
## 3 Saturday 138545
## 4 Sunday 120788
## 5 Thursday 168967
## 6 Tuesday 156663
## 7 Wednesday 159212
week_levels <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
ggplot(aes(x = factor(weekday, week_levels)), data = taxi) + geom_histogram() + xlab("Day of Week") +
ylab("Number of Trips") + ggtitle("Trips Per Day")
How many trips per date? We can see here the lull on January 21st (MLK Day). There’s also a dip on Janurary 6th (Sunday), which is the Epiphany, but I’m not convinced that’s why we see less taxi traffic. Also puzzling is the peak on January 26th (Saturday).
jan_taxi <- taxi %>%
group_by(pickup_date) %>%
summarize(trips = n())
ggplot(jan_taxi, aes(x = pickup_date, y = trips, group = 1)) +
geom_line(stat= 'identity') + ylab("Number of Trips") +
xlab("Pickup Date") + ggtitle("Number of Daily Trips in January 2013")
Total revenue each day of the week?
It appears that there are, in fact, differences in revenue throughout the week. Keep in mind that our sample contains just four observations of each day (totaling 1 month). I’m particularly interested in Monday and Friday, which seem to be the lowest and highest revenue days, respecively. In order to estimate significance, we can perform a t-test between Monday and Friday and between each of these days and Saturday, which looks to be about the middle.
# new data frame for each day of the week we're analyzing
taxi_days_M <- taxi_days %>%
dplyr::filter(weekday == "Monday")
taxi_days_F <- taxi_days %>%
dplyr::filter(weekday == "Friday")
taxi_days_S <- taxi_days %>%
dplyr::filter(weekday == "Saturday")
t.test(taxi_days_M$revenue, taxi_days_F$revenue)
##
## Welch Two Sample t-test
##
## data: taxi_days_M$revenue and taxi_days_F$revenue
## t = -3.5395, df = 5.6564, p-value = 0.01349
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -148113.19 -25970.34
## sample estimates:
## mean of x mean of y
## 404397.1 491438.9
t.test(taxi_days_M$revenue, taxi_days_S$revenue)
##
## Welch Two Sample t-test
##
## data: taxi_days_M$revenue and taxi_days_S$revenue
## t = -2.7849, df = 5.9544, p-value = 0.03205
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -107188.85 -6824.68
## sample estimates:
## mean of x mean of y
## 404397.1 461403.9
t.test(taxi_days_F$revenue, taxi_days_S$revenue)
##
## Welch Two Sample t-test
##
## data: taxi_days_F$revenue and taxi_days_S$revenue
## t = 1.2601, df = 5.4206, p-value = 0.2591
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -29831.08 89901.07
## sample estimates:
## mean of x mean of y
## 491438.9 461403.9
Because the p-value of the t-test for Friday/Saturday is >.05, we can conclude that the revenue on those two days is not statistically significant. However, both the Monday/Friday and Monday/Saturday tests are significant, indicating that the Monday revenue is not the same as the Friday and Saturday revenues.
Now I’ll explore the toll situation in the city. Looks like high tolls for Staten Island!
taxi %>%
dplyr::filter(tolls_amount>0) %>%
group_by(pickup_borough, dropoff_borough) %>%
summarize(num_rides_w_tolls = n(), average_toll = mean(tolls_amount)) %>%
ungroup() %>%
ungroup() %>%
arrange(desc(average_toll)) # can also order by number of toll rides
## Source: local data frame [25 x 4]
##
## pickup_borough dropoff_borough num_rides_w_tolls average_toll
## 1 bronx staten_island 38 11.742105
## 2 manhattan staten_island 1090 11.302046
## 3 brooklyn staten_island 60 11.147500
## 4 queens staten_island 98 10.745408
## 5 staten_island brooklyn 11 10.540909
## 6 staten_island manhattan 3 10.366667
## 7 staten_island staten_island 42 9.668571
## 8 staten_island queens 3 9.350000
## 9 brooklyn queens 118 6.250000
## 10 staten_island bronx 4 6.162500
## .. ... ... ... ...
How about trip length?
taxi_length <- taxi %>%
select(trip_distance, trip_time_in_secs) %>%
dplyr::filter(trip_distance>0)
taxi_length$trip_distance <- round(taxi_length$trip_distance)
taxi_length$trip_time_in_secs <- round(taxi_length$trip_time_in_secs, digits = -1)
ggplot(taxi_length, aes(x=trip_distance, y= trip_time_in_secs)) +
geom_point(shape=19, alpha=1/50) + xlim(0,30) + xlab("Trip Distance") +
ylab("Trip Time In Seconds") + ggtitle("Trip Time vs. Distance")
summary(taxi_length)
## trip_distance trip_time_in_secs
## Min. : 0.000 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 360.0
## Median : 2.000 Median : 560.0
## Mean : 2.797 Mean : 686.1
## 3rd Qu.: 3.000 3rd Qu.: 890.0
## Max. :100.000 Max. :9970.0
Looks like the vast majority of trips are within 25 miles and 50 minutes. Mean distance is 3.39 miles, mean time is about 13 minutes (802 seconds).
Can we find any trends with trips to airports? Perhaps there’s a flat rate fare. Here’s a histogram of fare amounts (not counting surcharges, tip, tolls, etc.).
ggplot(taxi, aes(x = fare_amount)) + geom_histogram(binwidth = 1) + xlim(0,57) +
xlab("Fare Amount") + ylab("Number of Trips") + ggtitle("Trip Fares")
Notice the peak at $52. Perhaps this is a flat rate?
taxi52 <- taxi %>%
dplyr::filter(fare_amount == 52)
ggplot(taxi52, aes(x = dropoff_longitude, y = dropoff_latitude)) + geom_point(alpha = 1/10) +
xlim(-74.2,-73.6) + ylim(40.5,41) + xlab("Dropoff Longitude") + ylab("Dropoff Latitude") +
ggtitle("$52 Fare Dropoff Locations")
Here we see two clusters. Let’s examine more closely.
# Checking out the cluster to the right
ggplot(taxi52, aes(x = pickup_longitude, y = pickup_latitude)) + geom_point(alpha = 1/10) +
xlim(-74.05,-73.75)+ ylim(40.5,41) + xlab("Dropoff Longitude") + ylab("Dropoff Latitude") +
ggtitle("Right Cluster of $52 Dropoff Locations")
And would you look at that! The cluster to the right is at approximately -73.78 longitude and 40.645 latitude, landing us right on top of JFK airport. If we look at the pickup data for the other point in the dropoff graph, we find that these customers were picked up at the airport and brough to various places in NYC. The internet confirms that this ride is a flat rate either way.
What if we look at general pickup/dropoff locations? Will we find any especially popular spots other than the airport?
ggplot(taxi, aes( x = pickup_longitude, y = pickup_latitude)) + geom_point(alpha = 1/500) +
xlim(-74.01,-73.85) + ylim(40.7,40.85) + theme_bw() + xlab("Pickup Longitude") +
ylab("Pickup Latitude") + ggtitle("Pickup Locations")
ggplot(taxi, aes( x = dropoff_longitude, y = dropoff_latitude)) + geom_point(alpha = 1/500) + xlim(-74.01,-73.85) + ylim(40.7,40.85) + theme_bw() + xlab("Dropoff Longitude") +
ylab("Dropoff Latitude") + ggtitle("Dropoff Locations")
By zooming in on this graph and identifying landmarks based on longitude/latitude data, I was able to identify the following as popular pick up and drop off locations: * LaGuardia Airport * Times Square * Penn Station/MSG * Columbus Circle * Car Inspection Station in Brooklyn * Dial 7 car/limo service in Queens * Brooklyn Driving School * Various auto shops
Let’s look at tipping. This data set only contains tip data for rides paid with cards, so we’ll limit our data set to those trips only.
tip_analysis <- taxi %>%
dplyr::filter(payment_type=="CRD") %>%
dplyr::filter(tip_percentage<=100)
ggplot(data = tip_analysis, aes(x = tip_percentage)) + geom_histogram() + xlim(0,50) +
xlab("Tip Percentage") + ylab("Count") + ggtitle("Tip Percentages")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = tip_analysis, aes(x = fare_amount, y = tip_percentage)) +
geom_point(alpha = 1/10) + xlim(0,100) + xlab("Fare Amount") + ylab("Tip Percentage") +
ggtitle("Fare Amount vs. Tip Percentage")
The bottom graph yields 2 main conclusions: 1. Customers are using the buttons on the card payment screen for 20, 25, and 30% tips (these are the horizontal lines in the graph). 2. Customers pay a smaller percentage when the rate is high (i.e. If I’m already spending $50 on a taxi, I don’t want to spend another $15 on the tip).
We can also see the vertical line at the $52 fare (many different tips for airport trips).
Let’s run a correlogram to see if there are any surprising relationships between our variables.
taxi_num <- taxi %>%
dplyr::mutate(tip_percentage = tip_amount/fare_amount*100) %>%
select(vendor_id, rate_code, passenger_count, trip_time_in_secs, trip_distance,
pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude,
fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount,
tip_percentage)
corrgram(taxi_num, order = TRUE, upper.panel = panel.pie, main = "Trip Correlations")
Nothing too interesting here… the fare, trip time, trip distance, tip amount total amount, etc. are all positively correlated. The pickup longitude and latitude have an inverse relationship, which makes sense since in NYC longitudes are negative while latitudes are positive. Correlograms have been helpful in analyses of other data sets, but here this method just illuminates the lack of surprising correlations in our data.
Let’s take a look at trip efficiency. Here I’ll define efficiency as trip distance/trip time (this will give us the average mph of the trip).
taxi_efficiency <- taxi %>%
dplyr::filter(trip_distance>0, trip_time_in_secs>0) %>%
mutate(mph = trip_distance/(trip_time_in_secs/3600)) %>%
arrange(mph)
head(taxi_efficiency)
## medallion hack_license
## 1 1D63D831583C751816DFB6BC7B910BC9 9C2E1C0BB311E30ABE89F4F13D1300D2
## 2 9F3B5F7A9F37581A5B033F4D2050C935 FE18CF22768A331E293E0D48F5821F27
## 3 2DF9CA4377B28B8963C9BD57286F7FEF 9FABAFB1B26E6462B3E309DF298749D8
## 4 66764B77F96DF43315499D73AA6DD84E 34241B14FE61FD74489E45B09E833F90
## 5 50F66AF7213A5511FDDAEEBB99C04FF4 51A03E200C82F7CA6A7273D7686103F5
## 6 ED0720EE1523446514917DAC38276C46 F8015807336B9656E0D199C6D5023917
## vendor_id rate_code store_and_fwd_flag pickup_date pickup_hour
## 1 VTS 1 <NA> 2013-01-21 18
## 2 VTS 1 <NA> 2013-01-12 21
## 3 VTS 1 <NA> 2013-01-20 1
## 4 VTS 1 <NA> 2013-01-03 8
## 5 CMT 1 N 2013-01-12 5
## 6 VTS 1 <NA> 2013-01-03 20
## pickup_min pickup_sec dropoff_datetime passenger_count
## 1 20 0 2013-01-21 19:10:00 1
## 2 12 0 2013-01-12 21:28:00 1
## 3 43 0 2013-01-20 01:57:00 1
## 4 54 0 2013-01-03 09:06:00 6
## 5 24 54 2013-01-12 07:07:54 1
## 6 57 0 2013-01-03 21:07:00 5
## trip_time_in_secs trip_distance pickup_longitude pickup_latitude
## 1 3000 0.01 -73.99160 40.75033
## 2 960 0.01 -73.98907 40.73432
## 3 840 0.01 -73.98151 40.77268
## 4 720 0.01 -73.99077 40.75611
## 5 6180 0.10 -73.97388 40.76351
## 6 600 0.01 -74.13508 40.83564
## dropoff_longitude dropoff_latitude
## 1 -73.99181 40.74995
## 2 -74.00905 40.71050
## 3 -73.98133 40.77256
## 4 -73.98252 40.75737
## 5 -73.94604 40.75315
## 6 -74.13663 40.84065
## id payment_type fare_amount
## 1 \\x8d2550e8bba0c5116c2b4393c0d2aefd688163f0 CSH 27.5
## 2 \\x084f737f9e878d7c6d56cd6d1f86ddbb30c6976b CSH 10.0
## 3 \\x85c54d7dd9fa2ee7d64488e28b5b15c093211c6b CSH 9.0
## 4 \\xea8111027579d0186d82aacb2df81555e93d03cd CSH 8.5
## 5 \\x53eed971144d63e607f6c0fd546c5c21dd7f1337 CSH 3.0
## 6 \\xce2e64500fefeea291f90eace9ff74a9fb345b30 CSH 7.0
## surcharge mta_tax tip_amount tolls_amount total_amount
## 1 0.0 0.5 0 0 28.0
## 2 0.5 0.5 0 0 11.0
## 3 0.5 0.5 0 0 10.0
## 4 0.0 0.5 0 0 9.0
## 5 0.0 0.5 0 0 3.5
## 6 0.5 0.5 0 0 8.0
## trip_id pickup_borough
## 1 \\x8d2550e8bba0c5116c2b4393c0d2aefd688163f0 manhattan
## 2 \\x084f737f9e878d7c6d56cd6d1f86ddbb30c6976b manhattan
## 3 \\x85c54d7dd9fa2ee7d64488e28b5b15c093211c6b manhattan
## 4 \\xea8111027579d0186d82aacb2df81555e93d03cd manhattan
## 5 \\x53eed971144d63e607f6c0fd546c5c21dd7f1337 manhattan
## 6 \\xce2e64500fefeea291f90eace9ff74a9fb345b30 staten_island
## dropoff_borough tip_percentage borough_pickups weekday mph
## 1 manhattan 0 904232 Monday 0.01200000
## 2 manhattan 0 904232 Saturday 0.03750000
## 3 manhattan 0 904232 Sunday 0.04285714
## 4 manhattan 0 904232 Thursday 0.05000000
## 5 manhattan 0 904232 Saturday 0.05825243
## 6 staten_island 0 484 Thursday 0.06000000
ggplot(data = taxi_efficiency, aes(x = pickup_hour, y = mph)) + geom_point(alpha = 1/510) +
ylim(0,30) + geom_smooth(se = F) + xlab("Pickup Hour") + ylab("Trip Efficiency (mph)") +
ggtitle("Trip Efficiency Throughout Day")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Looks like the efficiency peaks at around 4:30am but remains pretty consistent from 7am-7pm. We see higher efficiency outside of work hours, as expected.
I did the following to see if I could identify any patterns in the locations of inefficient trips. Pickups are in blue and dropoffs are in red. Looks like they can happen anywhere!
taxi_inefficient <- taxi_efficiency %>%
dplyr::filter(mph< 2, mph >.01, pickup_longitude!=0, pickup_latitude!=0)
ggplot(data = taxi_inefficient) +
geom_point(aes(y = pickup_latitude, x = pickup_longitude),color = "blue", alpha = 1/25) +
xlim(-74.02,-73.94) + ylim(40.7,40.8) +
geom_point(aes(x = dropoff_longitude, y = dropoff_latitude), color = "red", alpha = 1/25) +
xlab("Longitude") + ylab("Latitude") + ggtitle("Inefficient Pickups and Dropoffs")
Perhaps we can use machine learning to predict the length of a trip.
efficient_ml <- taxi_efficiency %>%
mutate(trip_time_in_mins = trip_time_in_secs/60) %>%
dplyr::filter(dropoff_latitude<41, pickup_latitude<41) %>%
select(rate_code, pickup_date, pickup_hour, passenger_count, pickup_longitude, pickup_latitude,
dropoff_latitude, dropoff_longitude, pickup_borough, dropoff_borough, payment_type,
surcharge, tolls_amount,weekday, trip_time_in_mins)
fit <- rpart(trip_time_in_mins ~ ., data = efficient_ml, method = "anova")
prp(fit)
This decision tree illustrates some factors that can affect trip length. Here’s a nicer version: